Introduction

This exercise should help you to get a better understanding of how the AOA estimation works, why it is important and what you might want to consider in order to develop models that are applicable for their respective purpose.

Prediction task

The task is to perfom a supervised land cover classification for Münster in Germany. The challenge is: There are no reference data available…at least none for training, you’ll get some later for validation ;-) Instead, for model training reference data from a few regions across Germany are provided (thanks to my Master Landscape ecology group at WWU!). This is a realistic scenario looking at the large variety of prediction maps e.g. in the context of ecology, where predictions are usually made far beyond training samples.

Data

To start with, let’s load and explore the data…

Raster data (predictor variables)

As predictor variables for spatial mapping of the area of Münster, you have a Sentinel-2 image available with some derived artificial channels (NDVI, standard deviations of NDVI in a 5x5 Pixel environment, Latitude, Longitude)

sen_ms <- stack("data/Sen_Muenster_sub.grd")

Let’s plot the rasterStack to get an idea how the variables look like.

plot(sen_ms)

plotRGB(sen_ms,stretch="lin",r=3,g=2,b=1)

Reference data

As reference data we have digitized and labeled training polygons for some spatially distinct regions across Germany. Sentinel-2 data have been extracted for the training polygons and each pixel covered by a polygon is used as potential training data point. We start with splitting the data into validation (Münster) and training (other regions) data.

trainSites <- readRDS("data/data_combined_ll.RDS")
trainDat <- trainSites[trainSites$Region!="Muenster",]
validationDat <- trainSites[trainSites$Region=="Muenster",]
head(trainSites)
##        ID B02 B03 B04 B08      B05      B06      B07      B8A     B11     B12
## 333971  1 760 552 275 199 227.5625 233.4375 231.6875 197.5000 64.8750 41.5000
## 333867  1 764 542 263 184 220.3750 217.0625 224.3125 198.6875 66.9375 35.1875
## 333949  1 770 554 276 201 227.3125 240.7500 244.4375 199.5625 67.3125 44.8125
## 333924  1 767 548 273 213 224.9375 243.0000 244.0625 208.3125 67.0625 41.0000
## 333945  1 755 558 274 193 224.4375 230.7500 241.9375 202.8750 66.3125 39.0625
## 333903  1 771 547 264 188 221.5625 225.7500 230.6875 201.7500 64.7500 29.2500
##              NDVI  NDVI_3x3_sd  NDVI_5x5_sd Label ClassID   Region      Lat
## 333971 -0.1603376 0.0010242277 0.0006361532 Water      17 Freiburg 48.00635
## 333867 -0.1767338 0.0007645315 0.0003339755 Water      17 Freiburg 48.00635
## 333949 -0.1572327 0.0012592688 0.0007046568 Water      17 Freiburg 48.00635
## 333924 -0.1234568 0.0014633634 0.0007525483 Water      17 Freiburg 48.00635
## 333945 -0.1734475 0.0010654820 0.0004294102 Water      17 Freiburg 48.00635
## 333903 -0.1681416 0.0014776551 0.0006619341 Water      17 Freiburg 48.00635
##             Lon
## 333971 7.758222
## 333867 7.758222
## 333949 7.758222
## 333924 7.758222
## 333945 7.758222
## 333903 7.758222
#see unique regions in train set:
unique(trainDat$Region)
## [1] "Freiburg" "Jena"     "Luebeck"

Predictors and response

In order to speed things up, for this example we will reduce the data. Therefore, from each training polygon only 15% of the pixels will be used for model training.

set.seed(100)
trainids <- createDataPartition(trainDat$ID,list=FALSE,p=0.15)
trainDat <- trainDat[trainids,]
trainDat <- trainDat[complete.cases(trainDat),]

For model training we need to define the predictor and response variables. To start with, as predictors we simply use basically all information from the raster stack without further consideration. As response variable we use the “Label” column of the data frame.

predictors <- names(sen_ms)
response <- "Label"

Model training and validation

We then train a random forest algorithm to learn relationships between the predictors and the land cover. As a first naive approach we use the default modelling way with a default random cross-validation.

# train the model
ctrl_default <- trainControl(method="cv", number = 3, savePredictions = TRUE)
set.seed(100)
model <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl_default,
               importance=TRUE,
               ntree=50)
model
## Random Forest 
## 
## 4290 samples
##   10 predictor
##   10 classes: 'ConiferousForest', 'DeciduousForest', 'Grassland', 'Industrial', 'MixedForest', 'OpenSoil', 'PlantedFields', 'Settlement', 'Urban', 'Water' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 2859, 2862, 2859 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9792539  0.9766389
##    6    0.9860130  0.9842484
##   10    0.9801856  0.9776829
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

Looking at the performance it looks great, apparently a nearly perfect prediction model (keep in mind Kappa can take values up to 1 = perfect fit).

Model prediction

To do the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function.

prediction <- predict(sen_ms,model)

However, quite obviously, the result doesn’t look very good compared to the RGB. Very obvious is that the city of Münster is mainly predicted as “open Soil”. Why is this possible keeping in mind that the cross-validation results indicated a nearly perfect model?

Can the model predict beyond training regions?

The reason is that we apply the model to make predictions for a new region. But the random CV doesn’t help to estimate the performance for Münster (or other regions). This is because the data are heavily clustered in space (i.e. several data points from the same training polygon, see explanation in the recordings of the previous summer schools), hence random CV is only helpful to answer the question how well the model can make predictions for the same cluster as used during training.

To get a more suitable performance estimate, we re-train the model with spatial CV. Since we want to apply the model to a new region within Germany, an obvious way of splitting the data is to split by region: In each iteration of cross-validation, one region is left out and used for validation instead of training.

# train the model
indices <- CreateSpacetimeFolds(trainDat,spacevar = "Region",k=length(unique(trainDat$Region)))
ctrl_sp <- trainControl(method="cv", index = indices$index, savePredictions = TRUE)

set.seed(100)
model_sp <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl_sp,
               importance=TRUE,
               ntree=50)

Let’s compare the performance estimates.

boxplot(model$resample$Kappa, model_sp$resample$Kappa,
        names=c("random CV", "spatial CV"),ylab="Kappa Index")

Based on the spatial CV, we expect the performance for Münster to be considerably lower compared to the random CV (Kappa around 0.4 to 0.6).

Validation with external validation data of Münster

We see large differences between random and spatial CV. Recently it has been argued against spatial CV for estimating map accuracy (https://www.sciencedirect.com/science/article/abs/pii/S0304380021002489?via%3Dihub). However, as outlined above, a spatial CV is the only meaningful way to estimate the performance for clustered data if no probability sample is available. Let’s test if the spatial CV estimates are reasonable: We now use the validation data and validate the predictions with these external data.

pred_muenster_valid <- predict(model_sp,validationDat)
head(pred_muenster_valid)
## [1] Water Water Water Water Water Water
## 10 Levels: ConiferousForest DeciduousForest Grassland ... Water
head(validationDat$Label)
## [1] "Water" "Water" "Water" "Water" "Water" "Water"
validation(pred_muenster_valid)
##  Accuracy     Kappa 
## 0.6240541 0.5562922

Seems like the external validation is well comparable to the spatial CV estimate (and as expected not at all comparable to the random CV estimate).

Area of Applicability

Technically it was no problem to make predictions for Münster. But let’s assess if we really should have done this. A trained model should only be applied to locations that feature predictor properties that are comparable to those of the training data. Read https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650 to get familiar with the idea. If dissimilarity to the training data is larger than the dissimilarity within the training data, Meyer and Pebesma (2021) suggest that the model should not be applied to this location.

The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.

cl <- makeCluster(4)
registerDoParallel(cl)
AOA <- aoa(sen_ms,model,cl=cl)
plot(AOA)

We see that the AOA has only values of 0 which means that no pixel falls inside the AOA. Now we can visualize the prediction for the AOA only.

We see that no predictions are shown because the model is not considered applicable to any part of the area of Münster.

Percentage of AOA

As a starting point for further improvement, let’s calculate the percentage of the AOA for Münster.

## [1] "Percentage of Münster that is within the AOA: 0 %"

Problem to be solved

Obviously, the model was not applicable to any part of Münster. The prediction patterns confirm that the model cannot make any reliable predictions here. Now it’s up to you to solve this.

Sample solution

To find explanations for the patters, let’s look at the variable importance within the model.

plot(varImp(model))

It’s interesting that latitude and longitude are so important here. However, we would not expect that these are meaningful predictors in order to predict beyond training regions (see discussions at previous summer schools). If this is the case the solution is quite easy: Removing these variables should improve the spatial model performance and should also increase the AOA because it will lead to a model that is less overfitted to the training data. So let’s do a spatial variable selection that is selecting a combination of predictor variables that lead to the best leave-region-out model performance. In this way we expect that that we can train a model that is as generalized as possible to be applied to a new region.

#ffs
set.seed(100)
model_ffs <- ffs(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl_sp,
               importance=TRUE,
               ntree=50)

plot_ffs(model_ffs)

plot_ffs(model_ffs,plotType = "selected")

Lets also have a look at the estimated spatial cross-validation performance in comparison to the original model

boxplot(model_sp$resample$Kappa,model_ffs$resample$Kappa,
        names=c("original model", "after variable selection"),ylab="Kappa Index")

Next, we make predictions for Münster.

Let’s then estimate the AOA of the new model.

cl <- makeCluster(4)
registerDoParallel(cl)
AOA_ffs <- aoa(sen_ms,model_ffs,cl=cl)
plot(AOA_ffs)

Let’s visualize the results:

… and validate the predictions and calculate percentage within the AOA.

##  Accuracy     Kappa 
## 0.6875000 0.6417889
## [1] "Percentage of Münster that is within the AOA: 99 %"

Using the spatial variable selection we could develop a more generalizable model which is reflected by:

  1. more meaningful prediction patterns

  2. higher spatial CV performance

  3. a larger AOA.